In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
# Some nice default configuration for plots
plt.rcParams['figure.figsize'] = 10, 7.5
plt.rcParams['axes.grid'] = True
plt.gray()
The first step is to load the training and set that was saved when running the Training and test feature set generation notebook.
In [2]:
cd ../../features/
In [3]:
posdata = loadtxt("human.iRefIndex.positive.vectors.txt", delimiter="\t", dtype="str")
Training with the full negative set is impossible on this computer, it is unable to load the file into RAM. To get around this, can use a negative training set only 10 times larger than the positive training set:
In [4]:
negdata = loadtxt("human.iRefIndex.negative.vectors.txt", delimiter="\t", dtype="str")
There are some conflicting entries in this notebook detected in the training and test set generation. These can be removed using their indexes:
In [5]:
X = np.concatenate((posdata,negdata))
In [6]:
plottablerows = X=="missing"
plottablerows = where(plottablerows.max(axis=1)==False)
Unfortunately, some "NA" strings have sneaked in somehow, so we will have to zero these. Appears to be due to a feature in Gene_Ontology:
In [7]:
NAinds = where(X=="NA")
In [8]:
X[NAinds] = 0.0
In [9]:
X[X=="missing"] = np.nan
X = X.astype(np.float)
Finally we can create the target vector y from what we know about the lengths of the positive and negative sets:
In [10]:
#create the target y vector
y = array([1]*len(posdata)+[0]*len(negdata))
We would like the training set we use to reflect our prior estimation that only one in every 600 combinations of protein identifiers is a true interaction. To ensure that our training set also reflects this we need to sample from it to maintain this proportion of positive to negative.
An efficient way to do this is to create a modified y vector which is in proportion and maps back to the original y vector. Then, after calling StatifiedShuffleSplit the returned indices can be transformed back to the correct indices on the original y vector:
In [11]:
def propy(y,samplesize=20000):
"""Function to produce proportional samples of y from the full dataset.
Takes as an input: y - the vector of class labels, samplesize - size of sample
Outputs: mody - modified y vector, transformdict - dictionary to map between indices"""
enuy = list(enumerate(y))
shuffle(enuy)
nones = int(samplesize/600)
nzeros = int((samplesize*599)/600)
transformdict = {}
mody = []
#iterate over this list and pull out the required number of ones and zeros
for j,(i,label) in enumerate(enuy):
if label == 1 and nones > 0:
transformdict[j] = i
mody.append(label)
nones += -1
elif label == 0 and nzeros > 0:
transformdict[j] = i
mody.append(label)
nzeros += -1
if nzeros == 0 and nones == 0:
break
return mody, transformdict
Then all that's required is to create a sub-sampled, but still large, version of the X and y vectors:
In [12]:
mody,tfdict = propy(y,samplesize=400000)
In [13]:
propindexes = array(tfdict.values())
X,y = X[propindexes],y[propindexes]
In [14]:
len(X),len(y)
Out[14]:
Each classifier will be tested in the following ways over :
The importance of different features will also be tested through elimination and repeating the tests above.
Finally, a sanity check looking at the model's prediction for some well-known protein interactions is the final test.
Logistic regression is a simple classifier which uses a linear transformation of the input variables through a sigmoid function.
In [15]:
cd tmp/
In [16]:
import sklearn.linear_model
In [17]:
import sys, os
sys.path.append(os.path.abspath("../../opencast-bio/"))
In [18]:
import ocbio.model_selection, ocbio.mmap_utils
reload(ocbio.model_selection)
reload(ocbio.mmap_utils)
Out[18]:
In [19]:
!ipcluster stop
In [20]:
!ipcluster start -n=10 --daemon
In [21]:
from IPython.parallel import *
In [22]:
client = Client(profile='default')
In [23]:
#initialise load balanced distributed processing
lb_view = client.load_balanced_view()
In [17]:
import sklearn.pipeline, sklearn.preprocessing
In [25]:
imputer = sklearn.preprocessing.Imputer(strategy='mean',missing_values="NaN")
scaler = sklearn.preprocessing.StandardScaler()
classifier = sklearn.linear_model.LogisticRegression()
model = sklearn.pipeline.Pipeline([('imp',imputer),('scl',scaler),('cls',classifier)])
We would like to plot a learning curve to get an idea of the required size of our training set to train a model on k-fold validation steps to get valid results. Until we decide on a final classifier it is preferable to be able to run fast iterations of different parameters for more complex parameters than to absolutely maximimise performance. A learning curve plots the performance of the algorithm on the training and test set for different training set sizes.
As with the imported code, much of this code is taken from the excellent parallel machine learning tutorial by Oliver Grisel.
In [49]:
trainsizes=logspace(3,5,5)
#initialise and start run
lcsearch = ocbio.model_selection.LearningCurve(lb_view)
lcsearch.launch_for_arrays(model,X,y,trainsizes,n_cv_iter=5,params={'cls__C':0.316}, name="lrlc")
Out[49]:
In [53]:
print lcsearch
In [54]:
lrlc = lcsearch.plot_curve()
In [55]:
savez("../../plots/bayes/logistic.regression.learning.curve.npz",lrlc)
This shows that we can test the model on a grid search with only 40,000 to 60,000 samples and still get meaningful results. When the grid search returns a best possible combination of parameters we will have to plot this learning curve again to make sure that the results are still valid. In the mean time, this will speed up our grid search considerably.
In [109]:
!ipcluster stop --profile=altcluster
In [110]:
!ipcluster start -n=10 --profile=altcluster --daemon
In [111]:
altclient = Client(profile='altcluster')
In [112]:
alb_view = altclient.load_balanced_view()
In [67]:
#make sure the processors aren't busy doing anything else:
alb_view.abort()
#initialise parameters to test:
lr_params = {
'cls__C':logspace(-3,1,7)
}
In [68]:
#initialise mmaped data - had to increase CV iterations to reduce output variance
split_filenames = ocbio.mmap_utils.persist_cv_splits(X,y, test_size=10000, train_size=40000,
n_cv_iter=10, name='testfeatures',random_state=42)
In [69]:
#initialise and start run
lrsearch = ocbio.model_selection.RandomizedGridSearch(alb_view)
lrsearch.launch_for_splits(model,lr_params,split_filenames)
Out[69]:
In [76]:
print lrsearch.report()
With this data only one in every 18 examples will be positive therefore the expected accuracy of a classifier that only predicts zeros is:
$$ 1 - \frac{1}{18} = \frac{17}{18} = 0.9(4) $$So this model is outperforming the hypothetical all zeros classifier.
We can implement this all zeros classifier and use it to create a comparative measure to make it easier to discriminate between classifiers.
In [62]:
import sklearn.dummy
In [63]:
def dummycompare(X,y,train,test,best_score):
dummy = sklearn.dummy.DummyClassifier(strategy="most_frequent")
dummy.fit(imputer.fit_transform(X[train]),y[train])
score = dummy.score(imputer.fit_transform(X[test]),y[test])
if score > best_score:
return 0.0
else:
#using a logit function here on a biased and scaled best score
best_score = (best_score-score)*(1.0/(1.0-score))
return best_score,log(best_score/(1.0-best_score))
In [83]:
results = np.zeros([5,2])
for i,(train,test) in enumerate(sklearn.cross_validation.StratifiedShuffleSplit(y,n_iter=5,
test_size=10000, train_size=40000)):
results[i,:] = dummycompare(X,y,train,test,lrsearch.find_bests()[0][0])
In [84]:
print mean(results,axis=0)
A ROC curve indicates the tradeoffs possible with this classifier in terms of true positive versus false positive rate when setting the decision threshold. In some applications the price of a high false positive rate is acceptable to obtain a high true positive rate. As the false positive rate grows to 1.0 though, the true positive rate also grows to 1.0 inevitably for any classifier as all samples are being classified as positive.
In [26]:
def plotroc(model,Xtest,ytest):
estimates = model.predict_proba(Xtest)
fpr, tpr, thresholds = sklearn.metrics.roc_curve(ytest,estimates[:,1])
clf()
plot(fpr, tpr)
plot([0, 1], [0, 1], 'k--')
xlim([0.0, 1.0])
ylim([0.0, 1.0])
xlabel('False Positive Rate')
ylabel('True Positive Rate')
title('Receiver operating characteristic AUC: {0}'.format(sklearn.metrics.roc_auc_score(ytest,estimates[:,1])))
show()
return fpr,tpr
In [30]:
train,test = next(iter(sklearn.cross_validation.StratifiedShuffleSplit(y,
test_size=0.25)))
In [89]:
bestparams = lrsearch.find_bests()[0][-1]
In [90]:
%%time
model.set_params(**bestparams)
model.fit(X[train],y[train])
fpr,tpr = plotroc(model,X[test],y[test])
We can see that the AUC score is positive, which means that the classifier is performing better than chance.
In [91]:
savez("../../plots/bayes/logistic.regression.roc.npz",fpr,tpr)
In [92]:
print dummycompare(X,y,train,test,model.score(X[test],y[test]))
Precision recall curves plot precision against recall.
These are implemented in Scikit-learn.
In [21]:
def drawprecisionrecall(X,y,test,model):
try:
yscore = model.decision_function(X[test])
except AttributeError:
yscore = model.predict_proba(X[test])[:,1]
precision,recall,_ = sklearn.metrics.precision_recall_curve(y[test],yscore)
average_precision = sklearn.metrics.average_precision_score(y[test], yscore,
average="micro")
plt.clf()
plt.plot(recall,precision)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall curve, AUC: {0:0.2f}'.format(average_precision))
plt.show()
return precision, recall
In [99]:
precision,recall = drawprecisionrecall(X,y,test,model)
This shows that it is not possible, even with loss of precision, to recall all of the positive training examples. Unfortunately, it is only possible to recall less than 10% with perfect precision.
In [100]:
savez("../../plots/bayes/lr_precisionrecall.npz",precision,recall)
In [101]:
plot(model.named_steps['cls'].coef_.T)
Out[101]:
In [102]:
savez("../../plots/bayes/logistic.regression.coef.npz",model.named_steps['cls'].coef_.T)
Which feature is the largest peak in the above plot?
In [103]:
where(model.named_steps['cls'].coef_.T==amax(model.named_steps['cls'].coef_.T))
Out[103]:
The largest peak in the above corresponds to a Gene Ontology feature. However, many of the features are almost as strongly weighted, showing that the classifier is not relying on a single feature.
A support vector machine is a kernel-based classifier which aims to fit a hyperplane between training examples to separate the examples into different classes. The algorithm aims to maximise the distance from any given training point to the hyperplane in the high-dimensional space of the training vectors.
Optimal performance of a SVM with the default parameters depends on scaling of the data so we must build a pipeline for the model as before with logistic regression:
In [102]:
import sklearn.svm
In [103]:
imputer = sklearn.preprocessing.Imputer(strategy='mean',missing_values="NaN")
scaler = sklearn.preprocessing.StandardScaler()
classifier = sklearn.svm.SVC()
svmodel = sklearn.pipeline.Pipeline([('imp',imputer),('scl',scaler),('cls',classifier)])
In [124]:
trainsizes=logspace(3,4.5,5)
#initialise and start run
lcsearch = ocbio.model_selection.LearningCurve(lb_view)
lcsearch.launch_for_arrays(svmodel,X,y,trainsizes,n_cv_iter=3,params={'cls__C':1.0}, name="lrlc")
Out[124]:
In [127]:
print lcsearch
In [126]:
svmlc = lcsearch.plot_curve()
In [108]:
savez("../../plots/bayes/svmlc.npz",svmlc)
From this curve we can conclude that at least 20,000 training samples are going to be necessary for meaningful results.
In [128]:
#initialise parameters to test:
svm_params = {
'cls__C':logspace(-1,1,3),
'cls__kernel':['rbf','poly','sigmoid'],
'cls__gamma':logspace(-1,1,3)
}
In [129]:
#initialise mmaped data
split_filenames = ocbio.mmap_utils.persist_cv_splits(X,y, test_size=2000, train_size=10000,
n_cv_iter=3, name='testfeatures',random_state=42)
In [117]:
#initialise and start run
search = ocbio.model_selection.RandomizedGridSearch(alb_view)
search.launch_for_splits(svmodel,svm_params,split_filenames)
Out[117]:
In [130]:
print search
This is taking an extremely long time to train, and is not feasible in the sample sizes above. Quote from Murphy textbook on SVMs:
SVMs also take $O(N^{3})$ time to train...
Which is also repeated in the documentation for the Scikit-learn SVM classifier.
Due to this problem, it does not seem to be well-suited to this data, where we will primarily be working with very large sample sizes.
In [131]:
train,test = next(iter(sklearn.cross_validation.StratifiedShuffleSplit(y,
test_size=2000,train_size=10000)))
print len(train),len(test)
In [132]:
svmparams = search.find_bests()[0][-1]
svmparams['cls__probability'] = True
In [133]:
%%time
svmodel.set_params(**svmparams)
svmodel.fit(X[train],y[train])
fpr,tpr = plotroc(svmodel,X[test],y[test])
That is an extremely poor ROC curve, probably due to the extremely small training set. It is unavoidable, though, due to the time it takes to train Support Vector Machines.
In [136]:
cd tmp
In [137]:
savez("../../plots/bayes/svm.roc.npz",fpr,tpr)
In [138]:
svmprecision,svmrecall = drawprecisionrecall(X,y,test,svmodel)
In [139]:
savez("../../plots/bayes/svm.precisionrecall.npz",svmprecision,svmrecall)
Random forest classifiers have proven to be effective in the task of protein interaction prediction in the past. A random forest classifier combines the results of many decision tree classifiers. In this way, it can react to correlations in the data in a way that simpler methods, such as logistic regression, cannot.
The combination of decision trees is implemented in Scikit-learn by averaging the results of the individual trees.
As above we will define the pipeline with the same scaling and imputation:
In [15]:
import sklearn.ensemble
In [18]:
imputer = sklearn.preprocessing.Imputer(strategy='mean',missing_values="NaN")
scaler = sklearn.preprocessing.StandardScaler()
classifier = sklearn.ensemble.RandomForestClassifier()
rfmodel = sklearn.pipeline.Pipeline([('imp',imputer),('scl',scaler),('cls',classifier)])
In [125]:
trainsizes=logspace(3,5,7)
#initialise and start run
rflcsearch = ocbio.model_selection.LearningCurve(alb_view)
rflcsearch.launch_for_arrays(rfmodel,X,y,trainsizes,n_cv_iter=3,
params={'cls__n_estimators':100,'cls__bootstrap':False},
name="lrlc")
Out[125]:
In [135]:
print rflcsearch
In [136]:
rflc = rflcsearch.plot_curve()
Training a single model with 100,000 to see how long each one of these is going to take:
In [137]:
savez("../../plots/bayes/random.forest.learning.curve.npz",rflc)
The major parameters to vary with a Random Forest classifier are:
n_estimators
- the number of trees in the forestmax_features
- size of random subsets of features to consider when splitting nodesAccording the to the documentation a good value for max_features
is the square root of the number of features, in our case approximately 15.
With the number of trees in the forests, the more trees the better the performance will be, but it will take longer to train the model.
We will try up to a maximum of 200 estimators to see at which point there are no longer significant gains from increasing the number of estimators.
Also the random trees are using bootstrapping by default when training, which is unnecessary on a dataset as large as this.
Therefore, we should set bootstrap=False
.
In [36]:
split_filenames = ocbio.mmap_utils.persist_cv_splits(X,y, test_size=20000, train_size=100000,
n_cv_iter=3, name='rffeatures',random_state=42)
In [37]:
rfparams = {'cls__n_estimators':100,'cls__bootstrap':False}
In [53]:
%%time
train,test = next(iter(sklearn.cross_validation.StratifiedShuffleSplit(y,
test_size=20000,train_size=100000)))
rfmodel.set_params(**rfparams)
rfmodel.fit(X[train],y[train])
print "Training score: {0}".format(rfmodel.score(X[train],y[train]))
print "Test score: {0}".format(rfmodel.score(X[test],y[test]))
In [38]:
rf_params = {
'cls__n_estimators':logspace(1,2.3,5).astype(np.int),
'cls__max_features':arange(5,30,5).astype(np.int),
'cls__bootstrap':[False]
}
In [39]:
rfsearch = ocbio.model_selection.RandomizedGridSearch(lb_view)
rfsearch.launch_for_splits(rfmodel,rf_params,split_filenames)
Out[39]:
In [51]:
print rfsearch
Training a the model we would like to use on a large training set to get the best possible performance:
In [52]:
rfparams = rfsearch.find_bests()[0][-1]
In [20]:
%%time
train,test = next(iter(sklearn.cross_validation.StratifiedShuffleSplit(y,
test_size=20000,train_size=200000)))
rfmodel.set_params(**rfparams)
rfmodel.fit(X[train],y[train])
print "Training score: {0}".format(rfmodel.score(X[train],y[train]))
print "Test score: {0}".format(rfmodel.score(X[test],y[test]))
In [70]:
rfroc = plotroc(rfmodel,X[test],y[test])
In [71]:
savez("../../plots/bayes/random.forest.roc.npz",rfroc)
In [22]:
rfprec,rfrecall = drawprecisionrecall(X,y,test,rfmodel)
In [26]:
!git annex unlock ../../plots/bayes/random.forest.precisionrecall.npz
In [27]:
savez("../../plots/bayes/random.forest.precisionrecall.npz",rfprec,rfrecall)
Finally, we can compare accuracy to the dummy case:
In [75]:
dummycompare(X,y,train,test,rfsearch.find_bests()[0][0])
Out[75]:
In Scikit-learn an alternative random forest in which the thresholds chosen for candidate features are randomly selected. This reduces the variance of the model at the cost of an increase in bias. These are called Extremely Randomized Trees.
In [48]:
imputer = sklearn.preprocessing.Imputer(strategy='mean',missing_values="NaN")
scaler = sklearn.preprocessing.StandardScaler()
classifier = sklearn.ensemble.ExtraTreesClassifier()
erfmodel = sklearn.pipeline.Pipeline([('imp',imputer),('scl',scaler),('cls',classifier)])
In [49]:
extrasearch = ocbio.model_selection.RandomizedGridSearch(alb_view)
extrasearch.launch_for_splits(erfmodel,rf_params,split_filenames)
Out[49]:
In [50]:
print extrasearch
The performance does not appear to be much better than the random forest. For this reason, it is probably not worth using this classifier over a simple random forest.
In [76]:
importances = rfmodel.named_steps['cls'].feature_importances_
In [77]:
plot(importances)
Out[77]:
In [78]:
savez("../../plots/bayes/random.forest.importances.npz",importances)
In [80]:
cd ..
In [82]:
import pickle
In [83]:
f = open("random.forest.bayes.classifier.pickle","wb")
pickle.dump(rfmodel,f)
f.close()
In [85]:
edgelist = loadtxt("human.activezone.txt",dtype=str)
NAinds = where(edgelist=="NA")
edgelist[NAinds] = 0.0
edgelist[edgelist=="missing"] = np.nan
edgelist = edgelist.astype(np.float)
In [86]:
estimates = rfmodel.predict_proba(edgelist)
In [87]:
print estimates[5,1]
In [90]:
import csv
In [91]:
f,fw = open("../forGAVIN/mergecode/OUT/edgelist.txt"), open("../forGAVIN/mergecode/OUT/edgelist_weighted.txt","w")
c,cw = csv.reader(f, delimiter="\t"), csv.writer(fw, delimiter="\t")
c.next()
for i,l in enumerate(c):
cw.writerow(l + [estimates[i,1]])
f.close(),fw.close()
Out[91]:
In [93]:
oneindexes = where(y==1)
zeroindexes = where(y==0)
In [94]:
onesamples = rfmodel.predict_proba(X[oneindexes])
In [100]:
zerosamples = rfmodel.predict_proba(X[zeroindexes])
In [101]:
f = open("random.forest.bayes.dist.samples.pickle","wb")
pickle.dump({1:onesamples,0:zerosamples},f)
f.close()